Consistency and Replication

Back to ece454

Distribution:

  1. Partitioning
  2. Replication

Replication

Why?

Examples

  1. GFS and HDFS
  2. Map-reduce
  3. MySQL server

Storage system behaviour

Sequential Consistency

This example is consistent:

Process time
P1 w(x)a
P2 w(x)b
P3 r(x)b r(x)a
P4 r(x)b r(x)a

If two writes occur in two different processes, they do not necessarily fall in temporal order. This only applies to writes in the same process.

There must be a sequential ordering of the events for it to be sequentially consistent.

This example is not:

Process time
P1 w(x)a
P2 w(x)b
P3 r(x)b r(x)a
P4 r(x)a r(x)b

Causal consistency

Process time
P1 w(x)a w(x)c
P2 r(x)a w(x)b
P3 r(x)a r(x)c r(x)b
P4 r(x)a r(x)b r(x)c

Has relationships "follows in program order" and "reads from", both examples of a causal precedence. This is causally consistent.

This is not:

Process time
P1 w(x)a
P2 r(x)a w(x)b
P3 r(x)b r(x)a
P4 r(x)a r(x)b

However, removing a read makes it consistent again:

Process time
P1 w(x)a
P2 w(x)b
P3 r(x)b r(x)a
P4 r(x)a r(x)b

Linearizability

Arrows represent the happens before relation.

Potential orderings:

Since there is a valid total order, the example is linearizable.

Alternative interpretation

This example is not linearizable:

Linearizability check algorithm

Another:

Eventual consistency

This allows different processes to observe write operations taking effect in different orders, even when these write operations are related by "causally precedes" or "happens before"

e.g. This does not have eventual consistency

  1. P1.w(x)a
  2. P2.w(x)b
  3. P3.r(x)a
  4. P4.r(x)b
  5. Repeat steps 3 and 4 forever.

Properties

Guarantees: Monotonicity

Every time you do a read, it should either be the same value as before, or a newer version

Negative example:

Process time
P1 w(x)a
P2 w(x)b
P3 r(x)a r(x)b
P4 r(x)b r(x)a

Positive example:

Process time
P1 w(x)a
P2 w(x)b
P3 r(x)a r(x)b
P4 r(x)b r(x)a

Guarantees: Read your own writes

If you write a value and then do a read, you should get the value you just wrote, not an older one

Protocols

Primary backup

A server has the true copy

Quorum

There is a maximum number of servers \(f\) that are allowed to fail, and a value is considered written after a majority have written.

Let \(N\), \(N_R\), and \(N_W\) be total number of replicas for a data object \(x\), the size of the read quorum, and the size of the write quorum.

In distributed databases, RW quorums must satisfy:

  1. \(N_R + N_W \gt N\) (read and write quorums overlap)
  2. \(N_W + N_W \gt N\) (two write quorums overlap)

Partial quorums

Partial quorums lack overlap

  1. \(N_R + N_W \gt N\) (strong consistency)
  2. \(N_R + N_W \le N\) (weak consistency)

Use last-timestamp-wins to deal with conflicts

Fault Tolerance

Dependability

Fault types

Masking failures

Consensus Problem

Zookeeper

Coordination

Events

Znodes

Fault-tolerant RPCs

Semantics under failures:

  1. client is unable to locate the server
  2. request message from the client to server is lost
  3. server crashes after receiving a request
  4. reply message from server to client is lost
  5. client crashes after sending a request

Primary backup

Update:

  1. Client sends to primary
  2. Primary locks exclusive lock
  3. Primary sends to backup
  4. Backup sends acknowledgement to primary
  5. Primary unlocks
  6. Primary sends an acknowledgement to client

Read:

  1. client sends to primary
  2. Primary locks shared lock
  3. Primary gets data
  4. Primary unlocks
  5. Primary sends data to client

Quorum (ACID)

  1. Client to server to send data and acquire locks
  2. Server sends data to client
  3. Client to server to release locks
  4. Server sends acknowledgement

To get linearizability:

Quorum (NoSQL key-value store)

Anti Entropy

Entropy is when servers have different values for the same key.

Solution: use a Merkle tree, hashing the hashes of its child node values

Checkpoints

2-phase commit is the A in ACID. 2-phase locking is the I in ACID.

A distributed checkpoint is a collection of checkpoints (one per process).

The recovery line is the most recent distributed snapshot.

Coordinated checkpointing algorithm